Project Set Up

Prior to data analysis and visualization, the following code was run to set up the R libraries and datasets required for this project.

##R libraries need for data manipulation and visualization
library(tidyverse)
library(ggcorrplot)
library(reshape2)
library(plotly)
library(DescTools)

##R datasets used in this project
data(msleep)
data(attitude)
data(Seatbelts)
data(WorldPhones)

I. Relationship Between Mammal Conservation Status and Sleep Behavior

Background

The dataset msleep comes from a study conducted by V. M. Savage and G. B. West on a group of mammals and their sleep time. There are 83 observation and 11 variables in this dataset, with each row representing a particular mammal. Variables in the dataset record the common name, genus, diet type, order, conservation status, total hours of sleep, hours of REM sleep, length of sleep cycle, hours awake, and body and brain weight in kilograms of each mammal.

The question we are hoping to explore is whether there exists differences in sleep behavior between mammals of different conservation status. We further categorized conservation status into the categories “lower risk”" and “threatened”" based on the definition provided in Wikipedia. We looked at sleep behavior by the percentage of hours in REM sleep over total hours of sleep.

Data Manipulation

We performed the following manipulations to create new variables that categorize conservation status and characterize sleep behavior, filter out observation with missing values, and produce visualization to explore our question of interest.

##Transfrom dataset to tibble, create variable risk_level that further categorizes conservation status, rename levels in the variable conservation, and variable that records the percentage of REM sleep over total hours of sleep
msleep_tib <-msleep %>% 
  as_tibble %>% 
  mutate(
  risk_level = ifelse(conservation %in% c("en", "vu"), "Threatened", 
                      ifelse(conservation %in% c("cd","domesticated","lc", "nt"),"Lower Risk", "Other")),
  conservation = plyr::revalue(conservation,c("en"="endangered", "vu"="vulnerable","cd"="conservation dependent","lc"="least concern","nt"="near threatened")),
  perc_rem = (sleep_rem/sleep_total)*100
  )

##filter out observations without conservation status
msleep_tib_for_plot <- msleep_tib %>% filter(!is.na(conservation))

##create plotly box plot that displays the distribution of percentage REM sleep grouped by conservation status and risk level
plot_perc_rem <- plot_ly(msleep_tib_for_plot, x = ~risk_level, y = ~perc_rem, color = ~conservation, type = "box") %>%
  layout(title="Percentage of REM Sleep in Mammals Across Conservation Status",
    xaxis=list(title="Conservation Status Risk Level"),
    yaxis = list(title="% REM Sleep"),
    boxmode = "group")
plot_perc_rem

Summary

The figure above visualizes the distribution of percentages of REM sleep in mammals. Each boxplot represents the distribution of REM sleep for mammals in the study of a particulare conservation status, which can be referenced by the colors in the legend. The boxplots are further grouped into the two risk levels ,“Lower Risk” and “”Threatened" for easier comparison. The upper and lower edge of each box represents 25th to 75th quantile and the center line marks the median of the distribution, whereas the whiskers reach out to the minimum and maximum values.

Since the boxs of the boxplots mostly overlap with each other, it seems that the distributions of REM sleep behavior are fairly similar among mammals of different conservation status. Some patterns worthy of note are that mammals with status “conservation dependent” have the lowest median and 25th quantile in percentage of REM sleep (12.8% and 3.7% respectively), mammals in “least concern” status have the widest distribution in percentage of REM sleep ranging from 7% to 4.7%, and the distribution of mammals that are “endangered” is higher than those of all other groups, with its 25th quantile, median, and 75th quantile being 21.7%,7.7%, and 33.7%.

It does not seem that we can reach any definite conclusion in how sleep behavior in terms of percentage of REM sleep differ between mammals across conservation statuts. It does seem that endangered mammals might have higher percentages of REM sleep from looking at this dataset. However, since the sample size for endangered mammals is small (only 4), it would be interesting to investigate further whether these differences in distribution hold if each conservation status corresponds to equal number of mammals. There might also be confounding factors that are more directly related to REM sleep, such as brain weight, that should be investigated.

II. Relationship Between Employee Questionnaire Response and Approval Rating

Background

The dataset attitude is taken from questionnaire responses of clerical employees at a financial organization. There are 30 observations and 7 variables in this dataset, corresponding to 30 departments of the organization and responses to the questions in the survey. The questionaire hoped to gauge employee perception on the company through questions related to positive opportunities and the management of negative factors in the work place. The questions asked touched on the following aspects: whether employee complaints are handled appropriately, whether or not special privileges are allowed, whether there are learning opportunities, whether there are performance-based raises, whether management is too critical of employee, and whether there is room for advancement. Thirty departments were randomly selected to participate and employee responses were aggregated into percentage of favourable response for each question as well as an overall rating for the company.

The question we hope to explore from this dataset is whether there is correlation between overall approval rating and the percentage of favourable response on each of the 6 questions. Are the relationships stronger for certain questions than others?

Data Manipulation

We performed the following manipulations to restructure the data, create a new variable, and produce visualization to explore our question of interest.

##make dataset a tibble, restructure it from wide to long, and create a variable that group questions to either related to positive of negative factors in the workplace
attitude_tib <- attitude %>% 
  as_tibble() %>% 
  melt(id.vars=c("rating"), measure.vars=c("complaints", "privileges", "learning","raises", "critical","advance")) %>% 
  mutate(question_category = ifelse(variable %in% c("complaints", "privileges","critical"), "Negative Factors","Positive Factors"))

##create a scatterplot with regression lines of overall rating against the percentage of favourable responses grouped by question
ggplot(attitude_tib, aes(x=value, y=rating)) + geom_point(aes(color=variable), position="jitter") + geom_smooth(aes(color=variable),method=lm, se=FALSE)+scale_color_manual(name="Question Topic",values=c("#032e70","#609af2","#3e81e8","#1855b2","#a0c3f7","#bcd3f4")) + xlab("% Favourable Response") + ylab("Overall Approval Rating") +ggtitle("Overall Rating against Percentage of Favourable Response \n Across Question Topics")

##create a scatterplot with regression lines of overall rating against the percentage of favourable responses grouped by question type (associated with positive of negative factors)
ggplot(attitude_tib, aes(x=value, y=rating)) + geom_point(aes(color=question_category), position="jitter") + geom_smooth(aes(color=question_category),method=lm, se=FALSE)+ xlab("% Favourable Response") + ylab("Overall Approval Rating") +scale_color_discrete(name="Question Topic Category")+ggtitle("Overall Rating against Percentage of Favourable Response \n Between Categories of Question Topics")

Summary:

The figures above explored the relationship between overall employee ratings of the organization with the percentage of favourable responses on each question in the survey. The percentage of favourable responses on each question are plotted against rating, with a line fitted for each question. Each line represents the relationship between the increase in rating and a percentage increase in favourable response for that question, and the steepness of the slopes show the strength of the relationship.

In the first figure, we can see that favourable opinion on how the organization handles complaints have the strongest positive relationship followed by opportunities for raises and learning. Whether managment is too critical and advancement opportunities have the weakest relationship with overall rating. If we group the questions into whether they are associated with postivie or negative aspects of the work place, with complaint, critical, and privileges being associated with negative factors. We see that the relationship between overall rating and percentage of favourable responses across the two groups are fairly similar since the lines are almost parallel. Therefore, it does not seem that postive or negative factors positively contribute to ratings more than the other.

Some further questions of interest could be whether the response to some of these questions correlate to each other and whether certain questions contribute to overal ratings more heavily in specific departments. These could help management of the organization target areas to improve on to increase employee approval.